Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

PCNN: Projection Convolutional Neural Networks

∂LP

∂C^l

= λ

W ^l

j ^◦

C^l

i ⁺^ηδ^ˆ

C^l

i,j

−^ˆC^l

i,j

◦

W ^l

j^,

(3.52)

where 1 is the indicator function [199] widely used to estimate the gradient of the nondif-

ferentiable function. More speciﬁcally, the output of the indicator function is 1 only if the

condition is satisﬁed; otherwise, 0. Updating W ^l

j^:^{Likewise, the gradient of the projection}

parameter δW l

j ^{consists of the following two parts}

δW l

j ⁼^∂L

∂W ^l

= ^∂L^S

∂W ^l

+ ^∂L^P

∂W ^l

(3.53)

W ^l

j ^←^W^l

j ⁻^η²^δW ^l

j ^,

(3.54)

where η2 is the learning rate for W ^l

j^{. We also have the following.}

∂LS

∂W ^l

∂LS

∂

W ^l

∂LS

∂^ˆC^l

i,j

∂P ^l,j

Ω^N⁽

W ^l

j^{, C}^l

i⁾

∂(

W ^l

j ^◦^C^l

i⁾

∂(

W ^l

j ^◦^C^l

i⁾

∂

W ^l

∂LS

∂^ˆC^l

i,j

◦1−1≤

W ^l

j^◦^C^l

i^≤¹^◦^C^l

(3.55)

∂LP

∂W ^l

=λ

W ^l

j ^◦

C^l

i ⁺^ηδˆ

C^l

i,j

−^ˆC^l

i,j

◦

C^l

i ⁺^ηδˆ

C^l

i,j

(3.56)

where h indicates the hth plane of the tensor along the channels. It shows that the proposed

algorithm can be trained from end to end, and we summarize the training procedure in

Algorithm 13. In the implementation, we use the mean of W in the forward process but

keep the original W in the backward propagation.

Note that in PCNNs for BNNs, we set U = 2 and a2 = −a1. Two binarization processes

are used in PCNNs. The ﬁrst is the kernel binarization, which is done based on the projec-

tion onto Ω^N, whose elements are calculated based on the mean absolute values of all full

precision kernels per layer [199] as

∥C^l

i^∥¹

(3.57)

where I is the total number of kernels.

3.5.7

Progressive Optimization

Training 1-bit CNNs is a highly non-convex optimization problem, and initialization states

will signiﬁcantly impact the convergence. Unlike the method in [159] that a real-valued CNN

model with the clip function pre-trained on ImageNet initializes the 1-bit CNNs models,

we propose applying a progressive optimization strategy in training 1-bit CNNs. Although

a real-valued CNN model can achieve high classiﬁcation accuracy, we doubt the converging

states between real-value and 1-bit CNNs, which may mistakenly guide the converging

process of 1-bit CNNs.